57 research outputs found

    Collecting, Analyzing and Predicting Socially-Driven Image Interestingness

    Get PDF
    International audienceInterestingness has recently become an emerging concept for visual content assessment. However, understanding and predicting image interestingness remains challenging as its judgment is highly subjective and usually context-dependent. In addition, existing datasets are quite small for in-depth analysis. To push forward research in this topic, a large-scale interestingness dataset (images and their associated metadata) is described in this paper and released for public use. We then propose computational models based on deep learning to predict image interestingness. We show that exploiting relevant contextual information derived from social metadata could greatly improve the prediction results. Finally we discuss some key findings and potential research directions for this emerging topic

    De la détection d'évènements sonores violents par SVM dans les films

    Get PDF
    National audienceThis article studies the behaviour of a state-of-the-art support vector machine audio event detection approach, applied to violent event detection in movies. The events we are trying to detect are screams, gunshots, explosions. Contrary to others studies, we show that the state-of-theart approach does not lead to good results on this task. A study on the repartition of samples into subsets in a cross validation protocol helps explain those results and highlights a generalisation problem due to a polymorphism of considered classes. This polymorphism is demonstrated by the computation the divergence between the samples of the test database and the training database.Cet article étudie le comportement d'une approche classique, à l'état de l'art, pour la détection d'événements sonores par machines à vecteurs supports, appliquée à la détection d'événements violents dans les films. Les événements sonores considérés, liés à la présence de violence, sont les Cris, les Coups de feu et les Explosions. Nous montrons que, contrairement aux résultats d'autres études, l'approche état de l'art ne donne pas de bons résultats sur cette tâche. Une étude sur la répartition des échantillons en sous-ensembles dans un protocole de validation croisée permet d'expliquer ces résultats et met en évidence un problème de généralisation, dû au polymorphisme des classes considérées. Ce polymorphisme est démontré par un calcul de divergence entre les échantillons de la base de test et ceux de la base d'apprentissage

    Audio Event Detection in Movies using Multiple Audio Words and Contextual Bayesian Networks

    Get PDF
    International audienceThis article investigates a novel use of the well known audio words representations to detect specific audio events, namely gunshots and explosions, in order to get more robustness towards soundtrack variability in Hollywood movies. An audio stream is processed as a sequence of stationary segments. Each segment is described by one or several audio words obtained by applying product quantization to standard features. Such a representation using multiple audio words constructed via product quantisation is one of the novelties described in this work. Based on this representation, Bayesian networks are used to exploit the contextual information in order to detect audio events. Experiments are performed on a comprehensive set of 15 movies, made publicly available. Results are comparable to the state of the art results obtained on the same dataset but show increased robustness to decision thresholds, however limiting the range of possible operating points in some conditions. Late fusion provides a solution to this issue

    Annotating, Understanding, and Predicting Long-term Video Memorability

    Get PDF
    International audienceMemorability can be regarded as a useful metric of video importance to help make a choice between competing videos. Research on computational understanding of video memorability is however in its early stages. There is no available dataset for modelling purposes, and the few previous attempts provided protocols to collect video memorability data that would be difficult to generalize. Furthermore, the computational features needed to build a robust memorability predictor remain largely undiscovered. In this article, we propose a new protocol to collect long-term video memorability annotations. We measure the memory performances of 104 participants from weeks to years after memorization to build a dataset of 660 videos for video memorability prediction. This dataset is made available for the research community. We then analyze the collected data in order to better understand video memorability, in particular the effects of response time, duration of memory retention and repetition of visualization on video memorability. We finally investigate the use of various types of audio and visual features and build a computational model for video memorability prediction. We conclude that high level visual semantics help better predict the memorability of videos

    De l'indexation d'évènements dans des films (application à la détection de violence)

    Get PDF
    Dans cette thèse, nous nous intéressons à la détection de concepts sémantiques dans des films "Hollywoodiens" à l'aide de concepts audio et vidéos, dans le cadre applicatif de la détection de violence. Nos travaux se portent sur deux axes : la détection de concepts audio violents, tels que les coups de feu et les explosions, puis la détection de violence, dans un premier temps uniquement fondée sur l'audio, et dans un deuxième temps fondée sur l'audio et la vidéo. Dans le cadre de la détection de concepts audio, nous mettons tout d'abord un problème de généralisation en lumière, et nous montrons que ce problème est probablement dû à une divergence statistique entre les attributs audio extraits des films. Nous proposons pour résoudre ce problème d'utiliser le concept des mots audio, de façon à réduire cette variabilité en groupant les échantillons par similarité, associé à des réseaux Bayésiens contextuels. Les résultats obtenus sont très encourageants, et une comparaison avec un état de l'art obtenu sur les même données montre que les résultats sont équivalents. Le système obtenu peut être soit très robuste vis-à-vis du seuil appliqué en utilisant la fusion précoce des attributs, soit proposer une grande variété de points de fonctionnement. Nous proposons enfin une adaptation de l'analyse factorielle développée dans le cadre de la reconnaissance du locuteur, et montrons que son intégration dans notre système améliore les résultats obtenus. Dans le cadre de la détection de violence, nous présentons la campagne d'évaluation MediaEval Affect Task 2012, dont l'objectif est de regrouper les équipes travaillant sur le sujet de la détection de violence. Nous proposons ensuite trois systèmes pour détecter la violence, deux fondés uniquement sur l'audio, le premier utilisant une description TF-IDF, et le second étant une intégration du système de détection de concepts audio dans le cadre de la détection violence, et un système multimodal utilisant l'apprentissage de structures de graphe dans des réseaux bayésiens. Les performances obtenues dans le cadre des différents systèmes, et une comparaison avec les systèmes développés dans le cadre de MediaEval, montrent que nous sommes au niveau de l'état de l'art, et révèlent la complexité de tels systèmes.In this thesis, we focus on the detection of semantic concepts in "Hollywood" movies using audio and video concepts for the detection of violence. We present experiments in two main areas : the detection of violent audio concepts such as gunshots and explosions, and the detection of violence, initially based only on audio, then based on both audio and video. In the context of audio concepts detection, we first show a generalisation arising between movies. We show that this problem is probably due to a statistical divergence between the audio features extracted from the movies. In order to solve it, we propose to use the concept of audio words, so as to reduce the variability by grouping samples by similarity, combined with contextual Bayesian networks. The results are very encouraging, and a comparison with the state of the art obtained on the same data shows that the results we obtain are equivalent. The resulting system can be either robust against the threshold applied by using early fusion of features, or provides a wide variety of operating points. We finally propose an adaptation of the factor analysis scheme developed in the context of speaker recognition, and show that its integration into our system improves the results. In the context of the detection of violence, we present the Mediaeval Affect Task 2012 evaluation campaign, which aims at bringing together teams working on the topic of violence detection. We then propose three systems for detecting the violence. The first two are based only on audio, the first using a TF-IDF description, and the second being the integration of the previous system for the detection violence. The last system we present is a multimodal system based on Bayesian networks that allows us to explore structure learning algorithms for graphs. The performance obtained in the different systems, and a comparison to the systems developed within Mediaeval, show that we are comparable to the state of the art, and show the complexity of such systems.RENNES1-Bibl. électronique (352382106) / SudocSudocFranceF

    MULTIMODAL INFORMATION FUSION AND TEMPORAL INTEGRATION FOR VIOLENCE DETECTION IN MOVIES

    Get PDF
    International audienceThis paper presents a violent shots detection system that studies several methods for introducing temporal and multimodal information in the framework. It also investigates different kinds of Bayesian network structure learning algorithms for modelling these problems. The system is trained and tested using the MediaEval 2011 Affect Task corpus, which comprises of 15 Hollywood movies. It is experimentally shown that both multimodality and temporality add interesting information into the system. Moreover, the analysis of the links between the variables of the resulting graphs yields important observations about the quality of the structure learning algorithms. Overall, our best system achieved 50% false alarms and 3% missed detection, which is among the best submissions in the MediaEval campaign

    Technicolor and INRIA/IRISA at MediaEval 2011: learning temporal modality integration with Bayesian Networks

    Get PDF
    International audienceThis paper presents the work done in Technicolor and INRIA regarding the A ffect Task at MediaEval 2011. This task aims at detecting violent shots in movies. We studied a bayesian network framework, and several ways of introducing temporality and multimodality in the framework

    MediaEval 2018: Predicting Media Memorability Task

    Full text link
    In this paper, we present the Predicting Media Memorability task, which is proposed as part of the MediaEval 2018 Benchmarking Initiative for Multimedia Evaluation. Participants are expected to design systems that automatically predict memorability scores for videos, which reflect the probability of a video being remembered. In contrast to previous work in image memorability prediction, where memorability was measured a few minutes after memorization, the proposed dataset comes with short-term and long-term memorability annotations. All task characteristics are described, namely: the task's challenges and breakthrough, the released data set and ground truth, the required participant runs and the evaluation metrics
    • …
    corecore